Financial Contributions to Presidential Campaigns in California by Miguel Angel Nieto

Opening the data set

When I was trying to open the dataset with normal read.csv function I got the following error:

Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate ‘row.names’ are not allowed

I opened the .csv with a normal text editor and counted the number of columns and values on each row. The number was correct in both cases, but there was something strange at the end of each row. They had a comma sign, so R was showing an error because there was no one-to-one relantionship.

Using function parameters like row.names=NULL fixes the loading process, but it screws the relationshipt between each column and the corresponding data. So I finally decided to use normal UNIX tools to fix the data set. This is what the original data looks like:

C00575795,“P00003392”,“Clinton, Hillary Rodham”,“KLEEMAN, ANNETTE”,“SANTA MONICA”,“CA”,“904021336”,“N/A”,“RETIRED”,100,04-MAY-16,“”,“X”,“* HILLARY VICTORY FUND”,“SA18”,“1079219”,“C5503390”,“P2016”,*

And this is what we need:

C00575795,“P00003392”,“Clinton, Hillary Rodham”,“KLEEMAN, ANNETTE”,“SANTA MONICA”,“CA”,“904021336”,“N/A”,“RETIRED”,100,04-MAY-16,“”,“X”,“* HILLARY VICTORY FUND”,“SA18”,“1079219”,“C5503390”,“P2016”

To fix the dataset I did the following:

$ cat P00000001-CA.csv | sed 's/.$//' > P00000001-CA-fixed.csv

P00000001-CA-fixed.csv is the file that I am going to use.

Univariate Plots Section

These two graphs show the number of contributions received grouped by different factors. In first graph we can see that Clinton and Sanders are both the ones with the largest number of contributions received. Both are democrats, so the second graph doesn’t really suprise us. Democrats are the ones that receive more contributions in California. Also, taking in account that they won the last two elections in that State, it shows that the party is really strong there.

Here I divided the people in different groups:

We see some trends. For example, seems that Clinton is the favourite of retired people. While in all other groups Sanders seems to be the winner in the number of contributions received. Clinton already said that she is against the idea of raising the retirement age, but I don’t think that opinion affect retired people since it doesn’t really affect them anymore. But it is still interesting to see that retired people are the ones that don’t follow what seemst to be the trend of contributing more times to Sanders.

Investigating the data by City, taking the 5 largest ones, we see again democrats dominating the numbers being Sanders again the one with more contributions received. At this moment, looks pretty clear that California is a Democrat state.

Summary

  • Democrats are clear winnes in the number of contributions received in whole State and also in each of the 5 biggest cities.
  • From all Democrats, Sanders is the winner being Clinton the second one in the number of contributions received.
  • Clinton seems to be the favourite among retired people. All other groups contribute more times to Sanders.

Univariate Analysis

What is the structure of your dataset?

The dataset includes 18 variables with 653397 observations. It includes contributions made by people to different candidates. People are defined by their name, the city where they life (and its zip code), ammount of money given, employer, occupation, receipt date and some other variables used to identify the contribution itself.

## [1] 653397     19
## 'data.frame':    653397 obs. of  19 variables:
##  $ cmte_id          : Factor w/ 24 levels "C00458844","C00500587",..: 6 7 6 7 7 6 6 7 7 6 ...
##  $ cand_id          : Factor w/ 24 levels "P00003392","P20002671",..: 1 12 1 12 12 1 1 12 12 1 ...
##  $ cand_nm          : chr  "Clinton" "Sanders" "Clinton" "Sanders" ...
##  $ contbr_nm        : Factor w/ 111178 levels "& DREW BURKE, MELANIE",..: 52929 56715 33796 57448 57448 69571 4518 57462 57470 98984 ...
##  $ contbr_city      : Factor w/ 1488 levels "","*MORENO VALLEY",..: 1202 186 1399 1077 1077 226 1055 1412 1442 1269 ...
##  $ contbr_st        : Factor w/ 1 level "CA": 1 1 1 1 1 1 1 1 1 1 ...
##  $ contbr_zip       : Factor w/ 93931 levels "","00000","000090272",..: 12352 43904 43667 11331 11331 65865 24612 29931 40813 85363 ...
##  $ contbr_employer  : Factor w/ 37122 levels ""," APPLE INC.",..: 21803 2509 26638 35170 35170 21803 28897 23312 22578 30853 ...
##  $ contbr_occupation: Factor w/ 16930 levels ""," REAL ESTATE BROKER",..: 12721 14127 12295 10681 10681 12721 7535 11666 9870 11600 ...
##  $ contb_receipt_amt: num  100 40 80 35 100 ...
##  $ contb_receipt_dt : Factor w/ 518 levels "01-APR-15","01-APR-16",..: 62 60 273 76 93 207 256 60 76 288 ...
##  $ receipt_desc     : Factor w/ 73 levels "","* EARMARKED CONTRIBUTION: SEE BELOW REATTRIBUTION/REFUND PENDING",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ memo_cd          : Factor w/ 2 levels "","X": 2 1 2 1 1 2 2 1 1 2 ...
##  $ memo_text        : Factor w/ 285 levels "","$1,500 REFUNDED ON 2/3/16.",..: 38 14 38 14 14 38 38 14 14 38 ...
##  $ form_tp          : Factor w/ 3 levels "SA17A","SA18",..: 2 1 2 1 1 2 2 1 1 2 ...
##  $ file_num         : int  1079219 1077404 1079219 1077404 1077404 1079219 1079219 1077404 1077404 1079219 ...
##  $ tran_id          : Factor w/ 650821 levels "A000771210424405B8CF",..: 161363 422120 163199 423564 425865 162306 166006 421582 423560 163924 ...
##  $ election_t       : Factor w/ 4 levels "","G2016","P2016",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ party            : Factor w/ 4 levels "democrat","republican",..: 1 1 1 1 1 1 1 1 1 1 ...

The summary of the dataset shows that Sanders is the candidate that received the largest number of contributions, but we still don’t know if that means the largest ammount of money. PENDERGAST, JAN is the one with more contributions, doing it 244 times. Los Angeles is the city with more contributions and unemployed are the ones that contributed more times. The median constribution is 27$ and the mean is 126$. There are big ouliners, for example negative contributions of -10000 (refunds) and positive of 10800 that we would need take in account.

##       cmte_id            cand_id         cand_nm         
##  C00577130:371470   P60007168:371470   Length:653397     
##  C00575795:163216   P00003392:163216   Class :character  
##  C00574624: 57129   P60006111: 57129   Mode  :character  
##  C00573519: 27342   P60005915: 27342                     
##  C00458844: 14089   P60006723: 14089                     
##  C00577312:  4696   P60007242:  4696                     
##  (Other)  : 15455   (Other)  : 15455                     
##               contbr_nm             contbr_city     contbr_st  
##  PENDERGAST, JAN   :   244   LOS ANGELES  : 48889   CA:653397  
##  MCLENNAN, MARLYN  :   238   SAN FRANCISCO: 43991              
##  AUSLENDER, LEONARD:   230   SAN DIEGO    : 23248              
##  WEIL, MONIQUE     :   224   OAKLAND      : 16394              
##  ISERI, MARTIN     :   212   SAN JOSE     : 15697              
##  SPEAR, JOSEPH     :   210   BERKELEY     : 12520              
##  (Other)           :652039   (Other)      :492658              
##      contbr_zip          contbr_employer      contbr_occupation 
##  926372766:   280   NONE         : 62886   NOT EMPLOYED: 99011  
##  916055507:   277   RETIRED      : 62531   RETIRED     : 96464  
##  950145153:   244   NOT EMPLOYED : 53310   TEACHER     : 15016  
##  946112036:   241   N/A          : 42497   ATTORNEY    : 14843  
##  921034727:   238   SELF EMPLOYED: 35884   ENGINEER    : 10008  
##  900695947:   231   (Other)      :395859   (Other)     :417961  
##  (Other)  :651886   NA's         :   430   NA's        :    94  
##  contb_receipt_amt   contb_receipt_dt 
##  Min.   :-10000.0   29-FEB-16: 11735  
##  1st Qu.:    15.0   31-MAR-16: 11506  
##  Median :    27.0   31-MAY-16:  9916  
##  Mean   :   126.1   30-APR-16:  9473  
##  3rd Qu.:    75.0   09-MAR-16:  8887  
##  Max.   : 10800.0   14-MAR-16:  8559  
##                     (Other)  :593321  
##                                   receipt_desc    memo_cd   
##                                         :642970    :628525  
##  Refund                                 :  4509   X: 24872  
##  REDESIGNATION FROM PRIMARY             :  1323             
##  REDESIGNATION TO GENERAL               :  1323             
##  REATTRIBUTION / REDESIGNATION REQUESTED:   569             
##  REATTRIBUTION TO SPOUSE                :   529             
##  (Other)                                :  2174             
##                                memo_text       form_tp      
##  * EARMARKED CONTRIBUTION: SEE BELOW:353637   SA17A:629287  
##                                     :272415   SA18 : 19601  
##  * HILLARY VICTORY FUND             : 19327   SB28A:  4509  
##  REDESIGNATION FROM PRIMARY         :  1323                 
##  REDESIGNATION TO GENERAL           :  1323                 
##  EARMARKED FROM MAKE DC LISTEN      :   858                 
##  (Other)                            :  4514                 
##     file_num                       tran_id       election_t    
##  Min.   :1003942   A5602AD777C8C4632B5A:     4        :   138  
##  1st Qu.:1066653   ADB49CB248C174E298F0:     4   G2016:  4958  
##  Median :1077404   A26C35A6066754130B99:     3   P2016:648294  
##  Mean   :1070263   A340DF85B7F884133A20:     3   P2020:     7  
##  3rd Qu.:1077665   A4E50E2DD07E4475996F:     3                 
##  Max.   :1079473   A7C22FA389E0348F98F0:     3                 
##                    (Other)             :653377                 
##          party       
##  democrat   :535561  
##  republican :117562  
##  green      :   197  
##  libertarian:    77  
##                      
##                      
## 

What is/are the main feature(s) of interest in your dataset?

This dataset is all about money contributions made by people to different candidates. Usually different group of people, depending on where they live, occupation (that affects yearly income) have different favourite candidates. The main features here are the different variables that define people, being the most interesting ones the employment status, city where they live and the amount of money contributed. Most of the variables are not really useful, like transaction ids, forms, file ids… so we are going to concentrate on those I have already described.

Money helps a lot to win the elections. It provides the candidates with capacity to use ads, spread propaganda and make their ideas to reach a larger audience. It could be that the more money they receive, the highest the changes to win. So, the largest cities and also people with more money could decide who is going to be the winner. The question is, will the data show that trend?

What other features in the dataset do you think will help support your

investigation into your feature(s) of interest?

In this first part I just counted the number of contributions, but that doesn’t show the full picture. Counting the total ammount of contributions per group will start to give us some more information and see what group contributes more to the winning candidate and which variables seems to have interesting outlines.

Did you create any new variables from existing variables in the dataset?

I created different dataframes to get data based on the variables I find more interesting:

  • biggest_five_cities datafrome, that includes data from top five cities. That includes Los Angeles, San Diego, San Jose, San Francisco and Fresno
  • not_employed dataframe for unemployed people
  • students dataframe for students
  • retired dataframe for retired people
  • other_people dataframe for all other employed people

The idea is to be able to use the data based on those particular variables without having to subset it every time I want to use it. That will make the code easier to write, read and mantain.

I also created the “Party” variable to specify to which party the candidate belongs to.

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the

form of the data? If so, why did you do this?

Unusual distribution and interesting data

  • The two top receiving number of contributions are Hillary Clinton and Bernard Sanders, both Democrats candidates.

  • If we keep investigating the number of contributions we also see that retired people prefer Hillary Clinton while the other groups go with Bernard Sanders.

  • Those without employment, and that usually means without that much money to spend, are the ones that contributed more times.

  • California is a democrat state.

Operations performed on the data

  • The cand_nm includes the Full Name. I used a regular expression to just leave the Family Name. That will make graphs easier to read.

Bivariate Plots Section

First graph shows that most of the contributions are bellow 2500$.

Next graphs tells us a totally different story the ones we saw in previous section. Sanders was the one receiving the largest number of contribution, but if we count the total amount of money: Clinton is getting nearly twice the money that Sanders receives.

In the other hand there are no surprises in the last graph. Democrats were the ones receiving more contributions and we see here they are also the winners in the total amount of money received.

Retired people contribute again to Cliton before every other candidate. Not employed ones are the only ones that don’t follow this trend. Most probably because Sanders proposed to:

expand Social Security, to the tune of $65 more per month on average, financed by raising payroll taxes on wealthy wage earnings. He’s also vowed to get the unemployed working again through a $1 trillion infrastructure plan the campaign says will create 13 million “good-paying” jobs.

Source

Here we see even more interesting trendings, that makes our first analysis of total number of contributions complety useless. In these 5 cities, Sanders was doubling the number of contributions received. But if we count the money, as we do in first graph, we see that Clinton gets more than twice money. She is the clear winner here. Democrats still get much more money (as they got much more contributions) than other parties.

If we check the graph per city, we also see big differences similar to the one explained before. In Los Angeles, San Diego, San Francisco and San Jose, the graphs are nearly the opposite of what we say when we just counted the number of contributions. Being Clinton again the winner in amount (what really matters), but not in the number of contributions.

The last thing we see is that Fresno is the one that goes in the opposite direction from the other 4 biggest cities. Republicans win in the amount of money contributed and Sanders is the one getting more money.

Summary

After adding a y variable to our graphs (total ammount of money contributed) we see that the picture changed a lot.

  • Democrats are still clear winnes in the number of contributions received in whole State, but not in Fresno.
  • Sander was the winner in the number of contributions received, but clearly Clinton receives the largest total amount of money.
  • Retired provided the largest number of contributions to Clinton, and we see here that also the largest amount of money. Clinton is the winner in all groups but “not employed” ones. Mostly the opposite of what we say before.

Now, things are even more clear. Democrats are the first ones, and from Democrats Clinton is in the first place.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. How did the feature(s) of interest vary with other features

in the dataset?

If we take in account the candidates, it varied a lot. It is clear that the number of contributes received doesn’t really correlate with the total amount of money. Sanders clearly gets the largest number of contributions, but Clinton gets the largest amount of money. Being both Democrats, it is still the winning party in the state.

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?

Not employed people prefer Sanders, because of his proposals. Also, Fresno seems to follow a different pattern from other biggest cities and also the state of California. Fresno seems to be a Republic city.

What was the strongest relationship you found?

Not employed people are, with no doubt, supporting Sanders. Seems to be the only group where Sanders has done a really good job.

Multivariate Plots Section

The new layer that adds colour depending on the party who receives the contribution reinforces the idea that CA is mostly democrat. Republicans’ numbers are pretty low, so it is difficult to get information from them. Let’s remove democrats so we can have an easier to read picture of all others.

Cruz and Rubio are the favourite ones among Republicans.

The mean line added shows that there are many ouliners here making it a bit useless without some adjustments first.

Limiting the y from 0 to 1000 and adding an alpha of 1/100 we can get a better picture. Clinton and Sanders have so high number of contributions that they are really above the mean. Both candidates are per-se ouliners. We see that people tend to contribute with 100\(, 250\), 500\(, 750\) and 1000$. Those are round numbers, so they appear often.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms of

looking at your feature(s) of interest?

There is not much new data, since we have already explored most of the useful variables of the data set. Adding parties as a third variable in our graphs helps us to see the big difference between Democrats and Republicans in California. Something we already found before but gets even more clear now.

Were there any interesting or surprising interactions between features?

Not really. The data we have seen was the expected one after previous analysis.


Final Plots and Summary

Plot One

Description One

The data we are analysing includes the amount of money each candidate has received from individual contributors in California. Therefore, the first graph shown here is created to get a first overview of general contributions in California and try to get an overall idea. It shows the total amount of dolars received by each candidate.

In the graph we can see the total amount of dolars received by each candidate. Colors represent the party to which the candidate belongs to. Clinton and Sanders, both democrat, are the ones getting more financial help. The difference between the two candidates is also pretty large with Clinton receiving almost twice more money than Sanders.

Contributions received by each candidate

## ca$cand_nm: Bush
## [1] 3327044
## -------------------------------------------------------- 
## ca$cand_nm: Carson
## [1] 2952109
## -------------------------------------------------------- 
## ca$cand_nm: Christie
## [1] 456066
## -------------------------------------------------------- 
## ca$cand_nm: Clinton
## [1] 39364896
## -------------------------------------------------------- 
## ca$cand_nm: Cruz
## [1] 6283360
## -------------------------------------------------------- 
## ca$cand_nm: Fiorina
## [1] 1468489
## -------------------------------------------------------- 
## ca$cand_nm: Gilmore
## [1] 8100
## -------------------------------------------------------- 
## ca$cand_nm: Graham
## [1] 408595
## -------------------------------------------------------- 
## ca$cand_nm: Huckabee
## [1] 230890.6
## -------------------------------------------------------- 
## ca$cand_nm: Jindal
## [1] 23231.26
## -------------------------------------------------------- 
## ca$cand_nm: Johnson
## [1] 41187.6
## -------------------------------------------------------- 
## ca$cand_nm: Kasich
## [1] 1553824
## -------------------------------------------------------- 
## ca$cand_nm: Lessig
## [1] 186144.5
## -------------------------------------------------------- 
## ca$cand_nm: O'Malley
## [1] 297834.3
## -------------------------------------------------------- 
## ca$cand_nm: Pataki
## [1] 30450
## -------------------------------------------------------- 
## ca$cand_nm: Paul
## [1] 797624.3
## -------------------------------------------------------- 
## ca$cand_nm: Perry
## [1] 208400
## -------------------------------------------------------- 
## ca$cand_nm: Rubio
## [1] 4846484
## -------------------------------------------------------- 
## ca$cand_nm: Sanders
## [1] 18763935
## -------------------------------------------------------- 
## ca$cand_nm: Santorum
## [1] 36254.88
## -------------------------------------------------------- 
## ca$cand_nm: Stein
## [1] 27918
## -------------------------------------------------------- 
## ca$cand_nm: Trump
## [1] 501389.2
## -------------------------------------------------------- 
## ca$cand_nm: Walker
## [1] 495006.9
## -------------------------------------------------------- 
## ca$cand_nm: Webb
## [1] 76568.16

This graph also shows that democrat party is dominant in California with both candidates acting as ouliners in the graph. The rest of candidates’ numbers really low in comparison.

Number of contributions received by each party

## 
##    democrat  republican       green libertarian 
##      535561      117562         197          77

Even if the graph looks pretty much straightforwad, we cannot just extrapolate the data to every single combination and group of people so in next graph I am going to show some extreme cases that actually show a complety different picture.

Plot Two

Description Two

The data set includes information about the employment of each contributor. This is a very important piece of information, because people tend to vote based on their personal situation. So, it is pretty usual to see different groups supporting different candidates depending on the wealth or job.

I have divided the contributors in different groups:

  • Unemployed
  • Students
  • Retired
  • All others

The idea is to find some group that doesn’t follow the general picture saw in previous graph. So, in this second picture we can see that unemployed people are the only group that prefer Sanders over Clinton. The difference is also very large.

Contributions received by Sanders in total from everybody:

## subset(ca, cand_nm == "Sanders")$cand_nm: Sanders
## [1] 18763935

Contributions received by Sanders in total from unemployed:

## subset(not_employed, cand_nm == "Sanders")$cand_nm: Sanders
## [1] 4954404

Sanders receives 26% of contributions from this group. More information about why this could happen in reflection section.

Plot Three

Description Three

Usually biggest cities have the biggest impact, just because of the population. So I have analyzed the data from the top 5 biggest cities in California and in the same way I did in graph two I tried to find something that goes against the ideas the first graph showed us.

In fact, we can see in this graph that Fresno is not democrat. Republican candidates get most of the contributions, being a totally different picture from the overall data of California. Actually, Fresno’s state and federal representation is mostly republican as we can see in the wikipedia links presented in next section.


Reflection

I don’t live in the United States, so I started this project without background knowledge of their democratic system or preconceived ideas. While analysing the data I learn some very interesting information, summarized in the last three graphs from previous section. I used google to find news articles and wikipedia information to find that what I was discovering just checking the graphs was the reality.

It is really amazing to see how you can discover real life facts and realities just by graphing numbers. Checking the selected three graphs also tell us something really important. The full picture, the graph of whole California, can’t be used as a perfect representation of each single city and group of people individually. There will be some that doesn’t follow the general rule and each because of different reasons.

I have encountered some problems while working with the data set. As I mentioned, most of the data is mostly useless and cannot be correlated with each other variable. Zip codes, candidate Ids, memo_cd, memo_text, file number and so on. So, I had to investigate only a small subset of those variables. To get some more information, I divided employment in groups and added the party of each candidate, so I could get some more data. But still, there were not many combinations. Actually taking in account that the topic was contributions made to politics, there is no much more information needed apart from money and some other metadata from the donors.

I was able to extract the data I was interested in and I learnt a lot about California, the different candidates and each political party.

For the future it could be even better if there was data about really big donations from enterprises. That would help us to see the usual relations between gas/oil/energy/technology companies and different candidates based on their ideas and future plans. Those big companies are the ones that really rule the world and their contributions will change it in a more drastic way that individuals and their donations.